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The dividing lines between system buses, system 
intraconnects, and system interconnects are getting more 
blurry all the time. And that is, oddly enough, going to turn 
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out to be a good thing in the long run. 


In the short run, this is a bit messy. There are a number of 
competing and complementary standards that span this 
middle ground between the processor and adjacent systems, 
many of which run atop the PCI-Express bus transport but 
which do more interesting things with it than just hanging 
storage or networking off the bus. Such as doing some form 
of memory sharing across devices, usually though some sort 
of coherency mechanism. Others are coming up with their 
own electrical or optical signaling. 


These include the Compute Express Link (CXL) from Intel, 
the Coherent Accelerator Interface (CAPI) from IBM, the 
Cache Coherence Interconnect for Accelerators (CCIX) from 
Xilinx, and the Infinity Fabric from AMD. Other 
interconnects try to get around some of the limitations of 
bandwidth or latency inherent in the PCI-Express bus, such 
as the NVLink interconnect from Nvidia and the OpenCAPI 
interconnect from IBM. OpenCAPI, which is supported on Big 
Blue’s Power9 processors, relies on special SERDES 
communication units on the chip that run at 25 Gb/sec and 
that can support a variant of the CAPI protocol or the NVLink 
protocol to attach Power9s to Nvidia Tesla GPU accelerators 
that also support NVLink — and do so in a coherent fashion 
across these different devices. ‘The Gen-Z interconnect from 
Hewlett Packard Enterprise links out from PCI-Express on 
servers to silicon photonics bridges and switches that hold 
out the promise a memory centric — rather than compute 
centric — architecture for systems. It can be used to hook 
anything from DRAM to flash to accelerators in meshes with 
any manner of CPU. 


At this point, all of these interconnects but Nvidia’s NVLink 
and AMD's Infinity Fabric has an independent consortium 
driving their specifications, and more than a few hyperscalers 
and vendors participate in multiple consortia to keep a hand 
in all of the different games. At some point, these may resolve 


https://www.nextplatform.com/2019/09/18/eating-the-interconnect-alpha... 


3/31/2021, 6:42 PM 


Eating The Interconnect Alphabet Soup With Intel’s CXL https://www.nextplatform.com/2019/09/18/eating-the-interconnect-alpha... 


3 of 18 


into a smaller set of transports and protocols that achieve the 
collective goals of these interconnects. They may compete to 
the death. But it sure doesn't look like it, not with Steve Fields, 
chief engineer of Power Systems at IBM who also spearheads 
OpenCAPI, and Gaurav Singh, corporate vice president at 
Xilinx and who spearheads CCIX, plus Dong Wei, standards 
architect at ARM Holdings and Nathan Kalyanasundharam, 
senior fellow at AMD, being four of the five members of the 
board of the new CXL Consortium, which was launched this 
week. Alibaba, Cisco Systems, Dell EMC, Facebook, Google, 
Hewlett Packard Enterprise, Huawei Technology, and 
Microsoft all jumped on the CXL bandwagon early, and 
together, these companies represent a big portion of the 
systems ecosystem when gaged by capacity sold or bought. 
Significantly, Nvidia has also joined up even though it does 
not have a seat on the CXL board. 


The only problem that we see initially with CXL, which was 
shown off in detail at the recent Hot Interconnects 
conference, is that it is tied to the PCI-Express 5.0 protocol, 
which is not yet available. PCI-Express 4.0, which came out in 
2017, is still only available with two processors — IBM’s 
Power9 and AMD's “Rome” Epyc 7002 — and while we are all 
excited that the PCI-Express 5.0 spec is coming out sometime 
this year and PCI-Express 6.0 is expected to be ratified in 
2021, it has taken far too long to get these faster buses into 


new chips. 


(Hmmm. It is a pity that the I/O is not all in a central hub ina 
chiplet architecture that could swap the I/O out without 
messing up the cores ... Oh wait, it is already with AMD's 
Rome and will very likely be with IBM’s Powerl0, which 
definitely supports PCI-Express 5.0 controllers and will 
almost certainly have a chiplet architecture. Intel itself doesn't 
expect to get products out the door supporting PCI-Express 
9.0 until 2021.) 


System builders and system buyers want to be able to have 
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some kind of fast links and coherence between CPUs and 
various kinds of accelerators and storage class memories — 
and they want it yesterday, which is how we ended up in this 
alphabet soup in the first place. 


Stephen Van Doren, an Intel Fellow and director of processor 
interconnect architecture at the chip maker, walked the 
bitheads at Hot Interconnects through the CXL architecture 
and talked about many of its finer points, but said that even 
though CXL would be aligning with the 32 Gb/sec PCI- 
Express 5.0 protocol, which is double what PCI-Express 4.0 
delivers, CXL would also be “a key driver for an aggressive 
timeline” to PCI-Express 6.0, which will double up the 
transfer rate once more time to 64 Gb/sec. That is eight times 
the bandwidth of the standard PCI-Express 3.0 link. We still 
think that PCI-Express is still the main bottleneck in 
systems, and anything that can be done to accelerate more 
bandwidth and more interesting connectivity over this bus is 


much welcomed. 


As we Said above, CXL is Intel’s own twist on adding PCI- 
Express that will support both I/O and memory 
disaggregation — a kind of Holy Grail for system architects 
that essentially virtualizes motherboards to make compute, 
memory, and I/O malleable across clusters of components — 
as well as computational offload to devices such as GPU and 
FPGA accelerators as well as memory buffers and other kinds 
of devices such as SmartNICs, which are computers in their 
own right. CXL is a set of sub-protocols that ride on the PCI- 
Express bus on a single link that give it some new tricks. Take 


a look: 





3/31/2021, 6:42 PM 


Eating The Interconnect Alphabet Soup With Intel’s CXL 


5 of 18 





CXL.io is the easy one, and it is basically the PCI-Express 
transaction layer that is reformatted to allow for the other 
two sub-protocols to co-exist side by side with it. CXL.io is 
used to discover devices in systems, manage interrupts, give 
access to registers, handle initialization, deal with signaling 
errors, and such. The CXL.cache sub-protocol allows for an 
accelerator into a system to access the CPU’s DRAM, and 
CXL.memory allows for the CPU to access the memory 
(whatever kind it is) in an accelerator (whatever kind of 
processing engine it is). 


“These three protocols are not necessarily required to be used 
in all configs,” explained Van Doren. “In fact, protocol 
subsetting is expected as part of the CXL ecosystem. And 
there’s basically three usage templates that track the relevant 
subset of usages that we expect to see.” 





The first subset is called a Type 1 device in the CXL 


nomenclature, and it is for devices that want to cache data 
from the CPU main memory locally. In this case, the devices 
only have to employ the CXL.io and CXL.cache layers. With a 
Type 2 device, there is memory on the accelerator and you 
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want an interplay between the CPU and the accelerator, so the 
CXL.io protocol is used to allow the CPU to discover the 
device and configure it, and then you use the CXL.cache to 
allow the processor to touch the device’s memory and 
CXL.memory in the opposite direction. The Type 3 device is a 
memory buffer, and in this case you need the CXL.io sub- 
protocol to discover and configure the device and the 
CXL.memory sub-protocol to allow the CPU to reach into the 
memory attached to your memory buffer. It is interesting to 
contemplate just how much memory you could hang off that 
CXL link in the right in the picture above — and what kind of 
memory it might be and how fast a PCI-Express 6.0 link 
might be to support it. 


Drilling down deeper than the sub-protocols to the link 
layers underneath CXL, this is where it is really different from 


the PCI-Express protocol — and obviously intentionally so. 





There are tradeoffs that have to be made in designing a link 
layer, explained Van Doren, and CXL is no exception. 


“For the CXL.io protocol, which is very PCle-like, we actually 
do run it through something that looks very much like the 
standard PCle link layer,’ said Van Doren. “But for the 
CXL.cache and the CXL.memory protocols, we actually have a 
very different link layer and we have an interface stack that 
does the multiplexing of the protocols further down, closer to 
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the PHY. These two different types of linked layers are 
differentiated based on whether you have fixed framing or 
not. PCIe has dynamic framing, which is very useful when 
you want to send messages of widely varying size — anything 
from 8 byte transactions to 4 KB transactions. When you're 
looking at CXL.cache and CXL.memory, you are in a cache 
coherent and memory semantic environment where all of the 
transfers are 64 byte cache line sizes. You have some control 
messages that go along with that, but the dynamic range of 
message sizes is quite constrained compared to what you see 
on PCle. And this creates an opportunity to build a link layer 
based on fixed thinking, which can save a lot of latency.” 


Van Doren said that the consortium will have a lot of partners 
with many different kinds of devices, and latencies will vary 
according to design and device type, but based on early 
designs the use of fixed framing for the CXL.cache and 
CXL.memory sub-protocols had an order of magnitude — 100 
nanoseconds — lower latency when using fixed framing. “So 
even though CXL adds a little bit more complexity to the 
interface stack, we think the savings in latency are worth the 


investment and complexity.” 


The one thing that everyone wants to know about these new 
in-system interconnects is how they are going to deal with 
cache coherency and allow for memory to be shared across 
devices. There was a lot of talk a few years back, when IBM’s 
CAPI and Nvidia’s NVLink were in development and rolling 
out that Intel would open up its QuickPath Interconnect (QPI) 
or its follow-on, Ultra-Path Interconnect (UPI), which is used 
to provide NUMA links between Xeon processors so they can 
share cache and main memory and present a shared memory 


space to operating systems. 


“This is one of the aspects of CXL that I think we get the most 
questions about,’ Van Doren said. “CXL has a cache coherency 
protocol that doesn’t look very much like the multi-socket 
cache coherency protocols that most folks are used to seeing. 
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It doesn't look like the one Intel does, the UPI multi-socket 
protocol, and it doesn't look those of our competitors, either.” 


The approach that Intel has taken is asymmetric, and this is 
not the first system architecture to do this, but it is the first 
one in a long time to do it. (Not to remind everyone, but IBM’s 
AS/400 systems had asymmetric multiprocessing decades 
ago, simply because the CPU was too precious of a 
commodity to burden with I/O and storage tasks. What’s old 


is new again.) 


With symmetric cache coherency protocols, which typically 
link the memories of separate CPUs to each other but which 
can also be used to link memories of accelerators to each 
other (NVLink can do this across GPU accelerators as well as 
IBM Power9 processors), the compute engines are front- 
ended by a protocol caching agent and the memories are 
front-ended by a protocol home agent, and the interconnect 
— UPI, Infinity Fabric, Bluelink, whatever — connects them all 
together using a low-latency, high bandwidth transport layer, 
as illustrated on the left of the chart below: 





In the case of an accelerator using the same cache coherency 


interconnect as the CPUs, the accelerator looks like any other 
CPU and its memory looks like any other memory block. But 
there is a drawback, said Van Doren, and that is that the cache 
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coherency is going to be bottlenecked by the bandwidth 
between the device and the accelerator. Which means is you 
are not going to literally put UPI on the accelerator, and use 
something like PCI-Express instead as a transport, that much 
slower interconnect is going to be a real bottleneck. 
Moreover, every server processor maker has its own NUMA 
interconnect and it is highly unlikely that they will all agree to 
adopt one of then as a standard. (Although that would be very 
convenient, especially if we had socket compatibility across 
processors, too. Imagine how wonderful that would be. ... ) 


So CXL gives up on strict CPU-style cache coherency with 
accelerators and uses an approach called biased coherency 
bypass, which is shown on the right side of the chart above. 
And in fact, there are two different biases, which are outlined 


here: 





The accelerator coherence is created with two complete — 
and completely different — flows, and importantly software 
can be used to flip back and forth between these two modes, 
which are called host bias and device bias, respectively. The 
idea is to get the benefits of cache coherency out to an offload 
engine without having to pay some of the high prices for full- 
on cache coherency as implemented between CPUs. 


“When you choose the host bias coherency protocol, 
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everything from the accelerator literally has to get bounced 
through the CPU,’ Van Doren explained. “The ordering point 
for even the memory lines on the accelerator is in that cache 
coherence agent inside the CPU. With the second flow, which 
we call device bias, the intuitive way to understand the way it 
works is that this flow essentially forces the CPU to interact 
with a memory as though it’s uncacheable memory. So the 
CPU can grab a copy of the data, but it’s not allowed to hold it 
in its caches for an extended period of time. What that means 
is that when my accelerator goes to access this memory, it is 
guaranteed that the CPU doesn't have a copy. So there is no 
reason to go over to the CPU and check.” 


Importantly, this is all going to be determined at the driver 
level, below the application and alongside of and in 
conjunction with the operating system. And it can happen on 
a page by page basis managed by that driver. Moreover, 
because of the asymmetric nature of the cache coherency, a 
CXL-enabled accelerator will not care one whit what 
coherency protocol is being used across the memories 
attached to the CPUs in the system. So with Intel CPUs, it can 
make use of UPI, with AMD CPUs, it can make use of Infinity 
Fabric, with Arm CPUs it can make use of CCIX, and with IBM 
CPUs it can make use of Bluelink. The other possible 
approach was to create some kind of universal symmetric 
protocol, or to try to get the symmetric protocols to bridge to 
each other. But that would be tough, and also yield a least 
common denominator compatibility with each individual 
symmetric cache coherency protocol, and we all know about 
how useful that would be in practice. 


“The asymmetry was actually built in by design to try to make 
this more of an open ecosystem,’ Van Doren elaborated. “And 
I think this is truth in advertising: Intel started down this 
direction because we wanted an ecosystem that was 
amenable to both our client and server CPUS. So this isn’t just 
a question of different vendors. It turns out if you want an 
open ecosystem where you can build an accelerator and plug 
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it into both the server CPU and the client CPU without putting 
a whole bunch of extra gunk on the accelerator that the client 
doesn't need, you need an interconnect that allows the server 
to have the server coherence engine, which is very big and 
scalable, and our client CPU gets to have a very lightweight, 
non-scalable coherence engine.’ 


This asymmetric approach is important for another reason: It 
is going to make memory disaggregation a lot easier. With a 
symmetric approach, any adjunct memory buffer in a system 
would need to have its own home agent and snoop filter to 
bring it into the shared memory space of the collection of 
CPUs in the system. But using CXL and its asymmetric 
approach, any memory buffer can make use of the home 
agents and snoop filters on the processors and do not have to 
have these electronics added in. As Van Doren put it, the 
symmetry needlessly complicates a device class - memory 


buffers — that should be simple to deploy. 


And that brings us to the offload model and why Intel thinks 
we need cache coherence at all. With the exception of the IBM 
Power9 processors and the Nvidia Tesla V100 GPU 
accelerators, which have NVLink ports linking them all into a 
shared memory space, we don't really have coherence 
between CPUs and accelerators, but Intel thinks that we do 
and CXL is about delivering that. 


There has been an evolution in how the memories of CPUs 


and accelerators interact, as shown below: 
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In the beginning, which Van Doren calls the split physical 


address with I/O link, the accelerator was on the PCI-Express 
bus but there were two distinct virtual address spaces — one 
for the CPUs and one for each individual accelerator. 


“If you had any data structures that were pointer-based, there 
was a lot of complexity and you would have to do data 
marshalling, which means take your pointer-based data 
structure, smash it down to an array-based data structure, 
copy it over, re-expand it. For every application, your 
developers would have to write these things. You could get 
some efficient data copies — if you looked at your efficiency 
on the wire, the bus would look pretty good — but application 
development was a really big pain in the butt.” 


And so the industry created unified addressing or shared 
virtual memory, depending on the naming convention 
accelerator makers employed, allowing for virtual addressing 
to be shared across CPUs and accelerators. This made certain 
things easier, but you still have two distinct physical address 
domains and it is not one big pool of writeback memory for 
applications to play in. And while the complexity was hidden 
by doing a lot of copying data back and forth between CPU 
memory and accelerator memory, there was a lot of memory 
management because access was driven by page faulting. 


The obvious fix, according to Intel, is to simply let the CPU 
access pages of memory in the accelerators, and for that you 
need coherency. And with asymmetric coherency, fine- 
grained memory control as is done between CPUs on a 
system is not the goal, but direct memory access is. Van 
Doren says that cache coherency protocols have a “bad habit” 
of leaving data in or pulling it to the wrong place, such as 
leaving it in the CPU cache instead of the accelerator memory 
where it will probably be more useful. The coherence bias 
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added into the CXL protocol makes use of the fact that 
memory usage on accelerators — bet they memory buffers or 
computational offload — are pretty well defined and can 
simply keep data back in the devices where it will be used 
next, saving CPUs cycles that might otherwise be used to do 
the pushing. There is a bias flip and the data is moved from 
the CPU cache to the accelerator memory in one fell swoop. 
This approach, says Van Doren, is much more efficient than 
migrating pages back and forth as is now done. 


All of this explains why all of the big players are joining the 
CXL Consortium, even Intel’s biggest processing and 
accelerator rivals. Everybody just wants for all of this iron to 
get along. 
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